{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 8 - Clustering with k-Means  *\n",
    "---\n",
    "\n",
    "\n",
    "### Content\n",
    "\n",
    "1. What is unsupervised learning?\n",
    "2. Clustering with k-Means\n",
    "3. Evaluating clustering - cluster cohesion and separation\n",
    "4. Using scikit-learn for clustering\n",
    "\n",
    "\n",
    "### Learning Outcomes\n",
    "\n",
    "At the end of this lecture, you should be able to:\n",
    "\n",
    "* explain the motivation behind clustering\n",
    "* recognise the types of problems that unsupervised machine learning (clustering) is applied to \n",
    "* explain the k-means clustering algorithm\n",
    "* apply the k-means algorithm and interpret its output\n",
    "* evaluate the results of clustering\n",
    "* apply clustering from scikit-learn\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "\\* Some material on clustering is sourced from Janert, P. K. (2010). Data analysis with open source tools. O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Unsupervised Machine Learning (Clustering)\n",
    "\n",
    "So far we have learned how to use **supervised machine learning** techniques like classification and regression. \n",
    "\n",
    "In supervised learning, we new exactly what the inputs and its expected outputs were. We would provide our machine learning algorithm with the **inputs** and ask it to **learn** to uncover the underlying patterns which would allow it to **correctly map to the provided corresponding outputs**. \n",
    "\n",
    "When deploying a classifier of this type, we would subsequently provide it with inputs that it has not seen before and expect it to **produce informed outputs** (class labels or continuous values) based on what the classifier has learned before.\n",
    "\n",
    "**Clustering** is a family of **unsupervised machine learning techniques**. It operates in a fundamentally different way to classification and regression. \n",
    "\n",
    "Clustering is a **method for discovering and visualising distinct groups** of similar data points. The purpose of clustering is to **find structure** and **discover patterns** in a dataset where no one piece of data within it provides a definitive answer. \n",
    "\n",
    "Clustering is often seen as an **exploratory method** which is a computationally-driven approach to discovering structure in data. It is to a large degree subjective and requires a strong understanding of the problem domain in order to ascribe meaning to the results.  The result of clustering are always sets of clusters; however, this does not necessarily mean that the clusters are significant, possess any real meaning and thus are present in the real-world domain from which the data originated. Many times the results will be devoid of meaning or very hard to make sense of. \n",
    "\n",
    "Clustering is heavily used in data-intensive domains. Examples of usage: groups of customers with similar buying patterns can automatically be detected by retailers who track customer purchases (loyalty/reward cards) and **marketing or retail strategies** can be developed from this; identification of areas of similar land use in an **earth observation** database; in **insurance** settings, groups of motor insurance policy holders with a high average claim cost can be identified; groups of similar houses according to their house type, value, and geographical location can be identified in **city planning** scenarios; in **medicine**, clustering can be used in automatically identifying cancerous cells in datasets without any prior labelling; **computational biology** also employs clustering to discover groups of genes that exhibit similar behaviour, which might indicate that they respond to a treatment in the same way or are part of the same biological pathway.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is a cluster?\n",
    "\n",
    "The concept of a \"cluster\" is in fact not well defined and lacks theoretical rigour.\n",
    "\n",
    "Simply saying **\"a cluster is a set of similar points\"** is in many ways **insufficient**. The same goes for **\"a cluster is a group of points that are close together\"**. The reason that these descriptions are insufficient is because clusters **must also be well separated** from each other.\n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_uniform.jpg)\n",
    "\n",
    "Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "An arguably more precise definition of what a cluster is: **\"contiguous regions of high data point density separated by regions of lower point density\".**\n",
    "\n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_smile.jpg)\n",
    "\n",
    "> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The previous definition of what regions of clusters should be opens up the possibilities for many types of very different clusters.\n",
    "\n",
    "The image below depicts regions of clusters made up of **graph-like relationships** rather than **point density**. In addition, this example depicts  scenarios where we observe **nested clusters** within other clusters. \n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_graph.jpg)\n",
    "\n",
    "> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cluster analysis does not need to be limited to just points in space, but can involve **strings as well non-geometric data like time-series**. The challenge in these scenarious then becomes of **how we define the notion of \"similarity\".** \n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_strings_time_series.jpg)\n",
    "\n",
    "> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Final examples of clusters which are easily discernible by human eyes but difficult to train computers to do are expressed in the images below. The **intertwining of the clusters** in one dataset and the **intersection of two clusters** in the other are a particular **challenge** to clustering algorithms and not trivially solved.\n",
    "\n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_complexity.jpg)\n",
    "\n",
    "> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example of Cluster Formation with Different Algorithms\n",
    "\n",
    "A good article and illustration of five key clustering algorithms, with their approaches to forming clusters can be found here: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/960/1*KrcZK0xYgTa4qFrVr0fO2w.gif)\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/552/1*bkFlVrrm4HACGfUzeBnErw.gif)\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/864/1*vyz94J_76dsVToaa4VG1Zg.gif)\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/1104/1*tc8UF-h0nQqUfLC8-0uInQ.gif)\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/720/1*OyXgise21a23D5JCss8Tlg.gif)\n",
    "\n",
    "![the-5-clustering-algorithms-data-scientists-need-to-know](https://miro.medium.com/max/1280/1*ET8kCcPpr893vNZFs8j4xg.gif)\n",
    "\n",
    "> Source:George Seif https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distance and Similarity Measures\n",
    "\n",
    "In the same way that we defined distance metrics in our usage of the kNN algorithm, the **clustering algorithms also require the definition of a function that returns a scalar value to denote distance or similarity between two points**. Whether the function expresses the distance or similarity between two points is a matter of choice and the domain - distance can be transformed into similarity and vice versa.\n",
    "\n",
    "The image below shows examples of some of the popular distance and similarity measures, including some we have already covered in this course. Despite its simplicity, the **Euclidean** distance metric remains one of the most widely used on numerical domains.\n",
    "\n",
    "Just as we discovered that data comprising of **different scales** can have a **negative effect** on the kNN algorithm, the same is true in clustering. Whenever we employ distance metrics like Euclidean, we **must perform normalization** before applying the clustering algorithm. Correlation-based similarity metrics are however resistant to range variability within data.\n",
    "\n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_distance_similarity.jpg)\n",
    "\n",
    "> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering Algorithms\n",
    "\n",
    "There are different families of clustering algorithms. \n",
    "\n",
    "### Tree Builders\n",
    "\n",
    "Tree building clustering algorithms construct **decision tree-like structures** by successively combining clusters that are “close” to each\n",
    "other into a larger cluster until only a single cluster remains. This technique is known as\n",
    "**agglomerative hierarchical clustering**. The final cluster representation is a tree-like hierarchy of clusters. \n",
    "\n",
    "Clusters that exhibit similarity are merged early, constituting the leaves part of a tree. More dissimilar \n",
    "clusters are joined later in the process, nearer the root of the tree. \n",
    "\n",
    "The resulting structure can be represented graphically in a **dendrogram** (seen in the image below that depicts clusters of blogs based on word usage from which topics emerge). To extract actual clusters from it, we need to walk the tree, evaluate the cluster properties for each subtree,\n",
    "and then cut the tree to obtain clusters.\n",
    "\n",
    "![Source Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc](../figures/cluster_dendrogram.jpg)\n",
    "\n",
    "Tree building clustering algorithms are very **computationally intensive**. However, their advantage is that they do not just produce a flat list of clusters, but instead **explicitly show the relationships** between the clusters that can then be interpreted for meaning.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Centre Seekers\n",
    "\n",
    "The most widely used category of clustering algorithms is the **k-means** family. \n",
    "\n",
    "The k-means algorithm is an **iterative algorithm**. The algorithm requires at the outset that the number of **expected clusters k be specified** as input to it. The key principle behind the algorithm is to **repeatedly re-calculate the centre or the centroid of each cluster**, and re-assign all the points to each centroid depending on their distance from it. This process is repeated until a predefined convergence is achieved or the maximum number of iterations are carried out.\n",
    "\n",
    "\n",
    "![Segaran, T. (2007). Programming collective intelligence: building smart web 2.0 applications.  O'Reilly Media, Inc.](../figures/cluster_kmeans.jpg)\n",
    "\n",
    "\n",
    "> Image Source: Segaran, T. (2007). Programming collective intelligence: building smart web 2.0 applications. \" O'Reilly Media, Inc.\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib as mtpl\n",
    "import math \n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "from sklearn import preprocessing\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "df = pd.io.parsers.read_csv(\n",
    "    '../datasets/wine_data.csv',\n",
    "     usecols=[0,6,7]\n",
    "    )\n",
    "\n",
    "df.columns=['Class','Magnesium','Flavanoids']\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.grid()\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.scatter(df.Magnesium, df.Flavanoids, s=90, alpha=0.6, c='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first need to normalise the data:\n",
    "\n",
    "**Exercise:** Re-scale the 'Magnesium' and 'Flavanoids' values using MinMax.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.grid()\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.scatter(df.Magnesium, df.Flavanoids, s=90, alpha=0.6, c='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cetroid is critical to understanding k-means. It is the average position of all the points in a set of data instances. In our often n-dimensional spaces, the centroid is the mean position of all the points in all of the coordinate directions i.e. it is the mean of each feature type.\n",
    "\n",
    "**Exercise 1:** Find the centroid for all the data points in the above graph and plot it as a black dot in the graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "\n",
    "# YOUR CODE HERE\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic algorithm can be summarized as follows: \n",
    "\n",
    "1. Randomly pick **k** centroids (or points that will be the center of your clusters) in multi dimensional d-space. Strive to make them near the data but away from each another.\n",
    "2. Assign each data point to the closest centroid.\n",
    "3. Move the centroids to the average location of the data points assigned to it.\n",
    "4. Repeat the preceding two steps until the assignments no longer change - or change very little."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When randomly selecting the initial centroid(s) for the kmeans algorithm, it is important that the randomly chosen centroid is within the data range for each of the feature types.\n",
    "\n",
    "**Exercise 2:**  Using the data above, randomly select a centroid that is within the range of values for Flavanoids and Magnesium and plot it as a black dot. (Use random.uniform(x_min,x_max) to generate a random number in the range x_min amd x_max)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.grid()\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.scatter(df.Magnesium, df.Flavanoids, s=90, alpha=0.6, c='red')\n",
    "\n",
    "# YOUR CODE HERE\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 3:** Given two centroids ([0.1,0.6] and [0.3,0.6]) in the graph below, calculate the closest points to each of the centroids using the Euclidean distance defined below. Then, calculate the centroid of each of the samples assigned to each of the centroids. Make the centroids of each of the two groups of points the new centroid, and plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eucledean_distance(x, y):\n",
    "    d = 0.0\n",
    "    \n",
    "    for i in range(len(x)):\n",
    "        d += (x[i] - y[i])**2\n",
    "    d = math.sqrt(d)\n",
    "    \n",
    "    return d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.scatter(df.Magnesium, df.Flavanoids, s=90, alpha=0.6, c='red')\n",
    "axes.scatter(0.1, 0.6, s=150, alpha=1, c='black')\n",
    "axes.scatter(0.3, 0.6, s=150, alpha=1, c='black')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.grid()\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.scatter(df.Magnesium, df.Flavanoids, s=90, alpha=0.6, c='red')\n",
    "\n",
    "counts = [0,0]\n",
    "centroids = [[0.1, 0.6], [0.3, 0.6]] # the starting centroids\n",
    "centroid_coordinates = [[0,0],[0,0]] # the updated centroids\n",
    "\n",
    "rows = df[['Magnesium','Flavanoids']].values\n",
    "\n",
    "for j in range(len(rows)):\n",
    "    row = rows[j]\n",
    "    best_match = 0\n",
    "    for i in range(2):\n",
    "       # YOUR CODE HERE\n",
    "       \n",
    "\n",
    "try:\n",
    "    centroid_coordinates[0][0] /= counts[0]\n",
    "    centroid_coordinates[0][1] /= counts[0]\n",
    "except:\n",
    "    centroid_coordinates[0] = centroids[0]\n",
    "\n",
    "try:\n",
    "    centroid_coordinates[1][0]  /= counts[1]\n",
    "    centroid_coordinates[1][1] /= counts[1]\n",
    "except:\n",
    "    centroid_coordinates[1] = centroids[1]\n",
    "\n",
    "\n",
    "axes.scatter(centroid_coordinates[0][0], centroid_coordinates[0][1], s=150, alpha=1, c='black')\n",
    "axes.scatter(centroid_coordinates[1][0], centroid_coordinates[1][1], s=150, alpha=1, c='black')\n",
    "print(centroids)\n",
    "print(centroid_coordinates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The k-means algorithm is **nondeterministic**. This means that different starting values may produce very **different results**. Given this, it is expected that the algorithm be **run multiple times** with different randomly selected starting values and to then compare results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**k-means cannot be used for categorical data** unless the distance/similarity function is redefined for categorical data. One option is to use Jaccard similarity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard_similarity(x,y):\n",
    "    x = set(x)\n",
    "    y = set(y)\n",
    "    intersection = len(set.intersection(x,y))\n",
    "    union = len(set.union(x, y))\n",
    "    \n",
    "    return intersection / float(union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - jaccard_similarity(['apples','bananas','oranges','kiwi','chrries'],['apples','bananas','oranges','grapefruit','grapes'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - jaccard_similarity(['apples','bananas','oranges','kiwi','chrries'],['apples','berries','pears','grapefruit','grapes'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def categorical_variable_distance(s1, s2):\n",
    "    if len(s1) != len(s2):\n",
    "        raise ValueError(\"Sequences are of unequal length\")\n",
    "    number_of_matches = sum(ch1 == ch2 for ch1, ch2 in zip(s1, s2))\n",
    "    \n",
    "    return (number_of_matches) / float(len(s1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - categorical_variable_distance(['apples','bananas','oranges','kiwi','chrries'],['apples','bananas','oranges','grapefruit','grapes'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - categorical_variable_distance(['apples','bananas','oranges','kiwi','chrries'],['apples','berries','pears','grapefruit','grapes'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Issues with k-means:\n",
    "\n",
    "- **Choosing k is more an art than a science**, although there are bounds: 1≤k ≤n, where n is number of data points.\n",
    "\n",
    "- There are **convergence issues** — the solution can fail to exist, if the algorithm falls into a loop, for example, and keeps going back and forth between two possible solutions, or in other words, there isn’t a single unique solution.\n",
    "\n",
    "- **Interpretability** can be a problem—sometimes the answer isn’t at all useful. Indeed that’s often the biggest problem. In spite of these issues, it’s pretty fast (compared to other clustering algorithms), and there are broad applications in marketing, computer vision (partitioning an image), or as a starting point for other models\n",
    "\n",
    "> Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. \" O'Reilly Media, Inc.\".\n",
    "\n",
    "Below is adapted python code for kmeans from:\n",
    "\n",
    "> Segaran, T. (2007). Programming collective intelligence: building smart web 2.0 applications. \" O'Reilly Media, Inc.\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.grid()\n",
    "axes.scatter(df.Magnesium.where(df.Class == 1), df.Flavanoids.where(df.Class == 1), s=90, alpha=0.6, c='red')\n",
    "axes.scatter(df.Magnesium.where(df.Class == 2), df.Flavanoids.where(df.Class == 2), s=90, alpha=0.6, c='blue')\n",
    "axes.scatter(df.Magnesium.where(df.Class == 3), df.Flavanoids.where(df.Class == 3), s=90, alpha=0.6, c='green')\n",
    "\n",
    "\n",
    "axes.scatter(0,0.9, s=150, alpha=1, c='black')\n",
    "axes.scatter(0.1,0.6, s=150, alpha=1, c='black')\n",
    "axes.scatter(0, 0.3, s=150, alpha=1, c='black')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kmeans_cluster(rows, distance = eucledean_distance, k = 3, iter = 10):\n",
    "    \n",
    "    # Determine the minimum and maximum values for each point\n",
    "    ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]\n",
    "    print(ranges)\n",
    "    \n",
    "    # Create k randomly placed centroids\n",
    "    #centroids=[[random.random( ) * (ranges[i][1] - ranges[i][0]) + ranges[i][0]  for i in range(len(rows[0]))] for j in range(k)]\n",
    "    centroids=[[0,0.9],[0.1,0.6],[0, 0.3]]\n",
    "    print(centroids)\n",
    "    \n",
    "    prev_cluster_labels = None\n",
    "    for t in range(iter):\n",
    "        #print('Iteration %d' % t)\n",
    "        cluster_labels = [[] for i in range(k)]\n",
    "        #print(cluster_labels)\n",
    "        # Find which centroid is the closest for each row\n",
    "        for j in range(len(rows)):\n",
    "            row = rows[j]\n",
    "            best_match = 0\n",
    "            for i in range(k):\n",
    "                d = distance(centroids[i],row)\n",
    "                if d < distance(centroids[best_match],row): \n",
    "                    best_match=i\n",
    "            cluster_labels[best_match].append(j)\n",
    "            \n",
    "        # If the results are the same as last time, this is complete\n",
    "        if cluster_labels == prev_cluster_labels:\n",
    "            break\n",
    "        prev_cluster_labels = cluster_labels\n",
    "    \n",
    "        # Move the centroids to the average of their members\n",
    "        for i in range(k):\n",
    "            avgs = [0.0] * len(rows[0])\n",
    "            if len(cluster_labels[i]) > 0:\n",
    "                for rowid in cluster_labels[i]:\n",
    "                    for m in range(len(rows[rowid])):\n",
    "                        avgs[m] += rows[rowid][m]\n",
    "                for j in range(len(avgs)):\n",
    "                    avgs[j] /= len(cluster_labels[i])\n",
    "                centroids[i] = avgs\n",
    "            \n",
    "    return cluster_labels, centroids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels, centroids = kmeans_cluster(rows=df[['Magnesium','Flavanoids']].values,k = 3, iter = 3)\n",
    "#print labels, centroids\n",
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(10,10)\n",
    "\n",
    "axes.set_xlabel('Magnesium',fontsize=30)\n",
    "axes.set_ylabel('Flavanoids',fontsize=30)\n",
    "plt.title('Wine Dataset',fontsize=30)\n",
    "axes.grid()\n",
    "\n",
    "axes.scatter(df.Magnesium.where(df.Class == 1), df.Flavanoids.where(df.Class == 1), s=90, alpha=0.6, c='red')\n",
    "axes.scatter(df.Magnesium.where(df.Class == 2), df.Flavanoids.where(df.Class == 2), s=90, alpha=0.6, c='blue')\n",
    "axes.scatter(df.Magnesium.where(df.Class == 3), df.Flavanoids.where(df.Class == 3), s=90, alpha=0.6, c='green')\n",
    "\n",
    "\n",
    "axes.scatter(centroids[0][0],centroids[0][1], s=150, alpha=1, c='black')\n",
    "axes.scatter(centroids[1][0], centroids[1][1], s=150, alpha=1, c='black')\n",
    "axes.scatter(centroids[2][0], centroids[2][1], s=150, alpha=1, c='black')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Evaluate the classification accuracy of the kmeans clustering above. print out the evaluation summary for each class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating clustering\n",
    "\n",
    "#### Cohesion and Separation\n",
    "\n",
    "We can evaluate each cluster based and its **cohesion** value. Cohesion describes the **closeness of the points within a cluster**. It is **average distance** of all the points in the cluster to the cluster centroid.\n",
    " \n",
    "\n",
    "**Separation** defines the average distance between a cluster and all the points outside of the cluster.\n",
    "\n",
    "Given the definition of the cohesion and separation for a single cluster, this can easily be extended to produce a measure of the overall clustering for a given problem. We can use the average cohesion and average separation (preferably weighted by each cluster mass which is the number of samples in each cluster) to give us global metrics for evaluating a clustering solution.  \n",
    " \n",
    "![.](../figures/cluster_cohesion_separation.jpg)\n",
    "\n",
    "\n",
    "Once we have calculated the global cohesion and separation values for each cluster, we can combine them into a single value that represents the ratio:\n",
    "\n",
    "$\\frac{separation}{cohesion}$\n",
    "\n",
    "where we expect the ratio to increase as the separability and the cohesion of a clustering solution improve.\n",
    "\n",
    "> Janert, P. K. (2010). Data analysis with open source tools. O'Reilly Media, Inc\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 4:** The kmeans_cluster cluster function above returned a list of labels (indexes) of all samples assigned to each cluster as well as a list of centroids for each cluster. Calculate the cohesion for cluster [0]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels[0]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "centroids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = df[['Magnesium','Flavanoids']].values\n",
    "\n",
    "dist = 0.0\n",
    "mass = 0\n",
    "\n",
    "#YOUR CODE HERE\n",
    "for row in labels[0]:\n",
    "\n",
    "\n",
    "dist / float(mass)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 5:**  Calculate the separation of cluster [0] to elements in cluster [1]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sep = 0.0\n",
    "num = 0\n",
    "\n",
    "#YOUR CODE HERE\n",
    "for row in labels[1]:\n",
    "\n",
    "    \n",
    "sep / float(num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 6:**  Calculate the separation of cluster [0] to all clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_sep = 0.0\n",
    "num = 0\n",
    "\n",
    "#YOUR CODE HERE\n",
    "for c in range(1,3):\n",
    "\n",
    "\n",
    "total_sep / float(num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Warning\n",
    "\n",
    "- Clustering is fun and fascinating but it can also lead you astray and become a waste of time. Obtaining useful results is often difficult. Interpretation is a challenge.\n",
    "\n",
    "- We tend to assume that our dataset has clusters, or a certain number, and this may prove to be wrong assumption.\n",
    "\n",
    "- Assuming you have found clusters and are able to draw out some meaning from them, the field is still lacking rigorous tools to evaluate the findings and even to know exactly what to do with them.\n",
    "\n",
    "- A good starting point with clustering is to formulate the question you are wanting to answer at the outset, together with hypotheses and then use the data to validate or disprove them.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To summarize:\n",
    "\n",
    "• The k-means algorithms and its variants work best for globular (at least star-convex)\n",
    "clusters. The results will be meaningless for clusters with complicated shapes and for\n",
    "nested clusters.\n",
    "\n",
    "• The expected number of clusters is required as an input. If this number is not known, it\n",
    "will be necessary to repeat the algorithm with different values and compare the results.\n",
    "\n",
    "• The algorithm is iterative and nondeterministic; the specific outcome may depend on\n",
    "the choice of starting values.\n",
    "\n",
    "• The k-means algorithm requires vector data; use a different distance method for categorical data\n",
    "\n",
    "• The algorithm can be misled if there are clusters of highly different size or different\n",
    "density.\n",
    "\n",
    "• The k-means algorithm is linear in the number of data points; the k-medoids algorithm\n",
    "is quadratic in the number of points.\n",
    "\n",
    "> Janert, P. K. (2010). Data analysis with open source tools. O'Reilly Media, Inc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering Using scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "from sklearn import preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km = KMeans(n_clusters=3, init='random')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km.fit(df[['Magnesium','Flavanoids']].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = km.predict(df[['Magnesium','Flavanoids']].values)\n",
    "predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another useful measure that provides a useful global figure that expresses how well the clustering has performed is the **Silhouette Coefficient (SC)**.\n",
    "\n",
    "The SC is calculated for each point as follows:\n",
    "\n",
    "a = average distance to all other points in its cluster\n",
    "\n",
    "b = average distance to all other points in the next nearest cluster\n",
    "\n",
    "SC = (b-a)/max(a, b)\n",
    "\n",
    "SC is in the range from -1 (worst) to 1 (best).\n",
    "\n",
    "A global SC is calculated by taking the average of the SC for all points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "metrics.silhouette_score(df[['Magnesium','Flavanoids']].values, predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since k-means essentially attempts to minimise the within-cluster sum of squares (WCSS), we can experimentally visualise this property for a range of k values in order to ascertain what is the ideal number of clusters - aiming for the smallest possible and the lowest WCSS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k_rng = range(1,10)\n",
    "est = [KMeans(n_clusters = k).fit(df[['Magnesium','Flavanoids']].values) for k in k_rng]\n",
    "\n",
    "# Generally want to minimize WSS, while also minimizing k\n",
    "within_cluster_sum_squares = [e.inertia_ for e in est]\n",
    "fig, axes = plt.subplots()\n",
    "fig.set_size_inches(15,20)\n",
    "# Plot the results\n",
    "plt.subplot(212)\n",
    "plt.plot(k_rng, within_cluster_sum_squares, 'b*-')\n",
    "plt.xlim([1,10])\n",
    "plt.grid(True)\n",
    "plt.xlabel('k', fontsize=20)\n",
    "plt.ylabel('Within Cluster Sum of Squares', fontsize=20)\n",
    "plt.title('Within Cluster Sum of Squares versus number of Clusters', fontsize=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "\n",
    "Given below is a dataset sourced from https://raw.githubusercontent.com/justmarkham/DAT7/master/data/beer.txt which contains information on properties of some of the most popular beer brands in the US:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "beer_df = pd.read_csv('../datasets/beer.csv')\n",
    "beer_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your task it to perform clustering on this dataset and evaluate the meaning of the clusters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
